Predicting link directions via a recursive subgraph-based ranking 
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Link directions are essential to the functionality of networks and their prediction is helpful towards 
a better knowledge of directed networks from incomplete real-world data. We study the problem 
of predicting the directions of some links by using the existence and directions of the rest of links. 
We propose a solution by first ranking nodes in a specific order and then predicting each link as 
stemming from a lower-ranked node towards a higher-ranked one. The proposed ranking method 
works recursively by utilizing local indicators on multiple scales, each corresponding to a subgraph 
extracted from the original network. Experiments on real networks show that the directions of a 
substantial fraction of links can be correctly recovered by our method, which outperforms either 
purely local or global methods. 

PACS numbers: 89. 75. He, 89.20.Ff, 89.65.-s 



INTRODUCTION 



Networks provide a powerful abstraction for describing 
the structures of a wide range of complex systems [3, [1J . 
Among them, many belong to the class of directed net- 
works — a set of nodes connected by links, where each 
link is associated with a direction pointing from one node 
to another. The directions of links reflect the logical or- 
der of interaction or dependence between two nodes. For 
example, they indicate the directional trend of informa- 
tion diffusion in an email network [3j and the relations 
between leaders and followers in Twitter [H . Other cases 
include the dependence of chemical substances in protein 
networks the preying relations among animals @, 
the hyperlinks connecting web pages Q, etc. Directions 
are essential to the functionality of networks: directness 
introduces asymmetric interactions into percolation and 
epidemic spreading on networks [8, 9]; directionality also 
influences the global emergence of collective behaviors 
^fioll and is critical for synchronization in networks [TTr - 

Unfortunately, data collected from real networks are 
often incomplete, giving rise to the study of link pre- 
diction, which seeks to predict missing links according 
to the observed data [l4j. While in the simple case of 
undirected networks, only the possible existence of a link 
between two nodes i and j is concerned, the task is more 
complicated for directed networks, where the issue of ex- 
istence and the issue of direction can be considered either 
simultaneously or separately: when simultaneously, one 
examines the existence of both i — > j and j — > i; when 
separately, one first predicts whether a link, regardless of 
its direction, exists between i and j and then, if it exists, 
tries to determine the direction of that link (i —± j, j — >• i 
or bidirectional). Whereas previous works on predict- 
ing directed links generally follow the former scheme by 
fitting a statistical graph model [la ], using local motifs 
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16], etc., we take the latter scheme in this paper. Specif- 
ically, while existence prediction can be aided by many 
similarity-based algorithms fl8j , we only focus on the 
essential problem of direction prediction, which remains 
largely to be investigated. 

Assuming two nodes are connected, how to predict 
the direction between them? To answer this question, 
we seek to construct an optimal ordering of nodes such 
that a link tends to stem from a node with lower rank- 
ing and point to one with higher ranking. Admittedly, 
such a ranking-based method inevitably has its draw- 
backs mainly due to directed cycles, as real networks are 
usually not directed acyclic graphs (DAG). The desired 
property that a link points from a lower ranked node to a 
higher ranked one must be violated at least once for each 
directed cycle. And specifically, this suggests that the 
method cannot predict bidirectional links as they are sim- 
ply directed cycles of length 2. Nevertheless, this method 
has its unique virtues: (i) it further reveals the potential 
functionality of ranking algorithms as a tool for inves- 
tigating structural properties of networks, beyond their 
traditional role in information retrieval; (ii) we obtain 
both the predicted directions and a global ranking de- 
scribing the directionality of the whole network, bridg- 
ing the properties on both microscopic and macroscopic 
scales; (iii) it may also serve as an effective approxima- 
tion algorithm for linear ordering problem and maximum 
acyclic subgraph problem on directed networks, which are 
generally NP-hard and have been studied especially for 
tournament graphs (every pair of vertices is connected 
by a single directed link) [l9j . 

The rich structural information woven by directed links 
have motivated a number of ranking algorithms for infor- 
mation retrieval. They are designed to derive an ordering 
of nodes by leveraging the topological relations in the net- 
work and the ranking criteria is usually based on a global 
score. For example, PageRank 20] ranks nodes by the 
stationary distribution of the probability of visitation by 
a random walker mimicking the behavior of an Internet 
surfer. Whereas PageRank powers the search engine of 
Google, its variants have also been applied to assessing 
the leadership in social networks [2l|, the prestige of jour- 
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nals [22j , the ranking of scientists [23[ and their papers 
24]. Besides, HITS is another famous ranking algorithm 
that derives the ranking by a process of mutual recursion 
(25j . It defines two scores for each node, namely hub and 
authority. And a node with a high hub score points to 
many good authorities while one with a high authority 
score receives links from many good hubs. 

However, for the task of predicting the direction be- 
tween two given nodes, a ranking completely based on 
global quantities or processes can hardly capture the lo- 
cal directionality. Therefore, local indicators, such as in- 
degree and out-degree, should be utilized by our ranking 
algorithm. But the effectiveness of purely local indicators 
are weakened by their limited scope of information — de- 
grees are only related to directly connected nodes while 
indirect relations are lost. Thus, local indicators must be 
combined and rearranged carefully to form a meaningful 
global ranking. In this paper, as inspired by the hierar- 
chical nature of disparate complex networks [15|, I26l428j , 
we propose a method that uses local indicators recur- 
sively on multiple scales, each of which corresponds to a 
subgraph extracted from the whole network. Although 
local quantities may only give a rough global sketch, they 
can reliably capture local properties. Therefore, they 
should play a more decisive role as the scale diminishes 
due to their increasing fineness for describing relations in 
locality. Apart from its predictive purpose, our method 
may also lead to a deeper insight into the directional and 
hierarchical organization of many real networks. 




R i: 1 2 3 6 5 4 
R 2 : 1 3 2 5 6 4 

FIG. 1: A simple network where blue dashed links belong to 
E c . Two rankings Ri and R2 are given below the network. 
Ranking R\ reaches a conformity of 0.5 by giving two opposite 
predictions (contradicting 2 — s- 3 and 6 — > 5), whereas R2 
reaches a perfect conformity of 1. 

and j — > i are counted if i and j are found to be re- 
ciprocally connected). C is simply the ratio of correctly 
predicted directions to the total number of links in E c . 
The maximum value of C is 1 corresponding to a perfect 
prediction, although not always attainable due to cycles 
in the network, and a value of 0.5 means guessing the 
direction by pure chance. Fig. [T] gives a simple example 
where i?i reaches a conformity of 0.5 and R2 reaches a 
perfect conformity of 1. 



III. METHODS 



II. PROBLEM DESCRIPTION 

Given a directed network G(V,E), where V denotes 
the set of nodes and E the set of links, the directions 
of a portion of links are unknown (denoted by the set 
E c ), and we are then asked to predict the directions of 
these links based on the existence and directions of other 
known links (denoted by E n = E — E c ), possibly also 
using the existence of the links in E c . 

Among the varieties of possible solutions, we specifi- 
cally consider resolving this problem by constructing a 
special ranking R. Denoting the place of node i in the 
ranking as R(i) (a small R(i) means a top ranking), then 
for any link in E c connecting i and j, the link is predicted 
to be i — > j if R(i) > R(j) or j — > i if R(i) < R(j). As any 
two nodes are assigned to different places in the ranking, 
predicting two-way links is not considered here. 

Once the directions of the links in E c are discovered, 
the performance of a ranking R can be evaluated by com- 
puting its conformity with these links, i.e. the accuracy 
of direction prediction, given by 

where || ■ || denotes the number of elements in the set 
and denotes a link between i and j (both i —} j 



Our method relies on the assumption that the for- 
mation of networks is regulated by an implicit ranking 
of nodes, such that links tend to originate from lower- 
ranked nodes and point to higher-ranked ones. Such a 
ranking, if can be approximately derived from the ob- 
served data, is therefore useful for predicting the di- 
rections of missing links. While a maximum-likelihood 
method has been recently proposed for extracting this 
ranking from friendship networks i'l'.i , we take a different 
approach by combining local indicators with hierarchical 
organizations of networks. 

Our method is best explained by considering a sim- 
ple example of social networks as illustrated by Fig. [2] 
which is made up of a few leaders v\,V2, «3 and many 
followers u%,U2,--- ,u n . Intuitively, leaders should en- 
joy higher ranking than followers and local quantities 
like degrees are useful for identifying both of them. 
As leaders are supposed to have bigger in-degrees and 
smaller out-degrees, we adopt the the degree difference 
D A = D ln — D out as the local indicator, which, as will 
be demonstrated by experiments, outperforms either in- 
degrec or out-degree alone. Clearly from the example, 
the leaders Vi , V2, V3 can be separated from followers by 
noting their higher degree differences (D A is positive for 
Vi,V2, «3 while negative for uj, U2, ■•■ , u n ). 

However, the internal relations among leaders cannot 
be readily determined in this way as the large number 
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FIG. 2: A network made up of followers ui, 112, 113, • • • ,u n 
and a few leaders «i,W2,«3- While the stratification between 
leaders and followers can be determined by degree differences, 
the ordering among leaders has to be obtained by extracting 
their induced subgraph (red links). 
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FIG. 3: A schematic illustration of the recursive ranking pro- 
cedure, where the red section denotes the set of leaders and 
the blue section denotes the set of followers. Such division af- 
ter reordering occurs on consecutively diminishing scales, un- 
til the subgraph contains too few nodes for subdivision. Here 
we focus on the process within the initial class of leaders and 
other nodes are masked by grey shades for clarity. 



of their followers may overwhelm the degrees induced by 
their interrelations. In this case, although has the 
highest degree difference (D A — 4), while v\ and U3 
have the same lower degree difference (D A — 2), U3 is 
obviously the leader of the highest level. This problem 
can be remedied by leveraging the hierarchical nature of 
disparate networks on multiple scales [TH, I26T - I28} — the 
relations among nodes on a smaller scale can be deter- 
mined in a way similar to that on a larger scale. We then 
explore the relations among v\ , V2 and V3 by extracting 
the subgraph induced by them and their degree differ- 
ences in the subgraph evidently reveal their ordering — 
V3,V2,V\. All of them, of course, are placed higher than 
Mi, 1*2, • • • , u n in the global ranking. 

We develop this idea into our algorithm: on a certain 
scale, nodes in the graph are divided into two classes by 
sorting the degree difference of each node. The internal 
orderings of nodes in each class are respectively deter- 
mined in the subgraphs induced by them in a recursive 
fashion, while always placing the class of leaders as a 
whole ahead of the class of followers in the ranking. 

A detailed explanation is presented as follows. Consid- 
ering a directed network G(V,E), we examine the net- 
work on a certain scale by focusing on the subgraph G v 

induced by a subset of nodes V C V. For any subgraph 
G v (including G itself), let I(i;V) = I (1 < I < \\V\\) 

denote that node i (i S V) takes the l-th place sorted 
by the degree difference D A in the subgraph in descend- 
ing order. Then the set of nodes V, if large enough, is 
further divided into the set of leaders Vl{V) and the set 
of followers Vf(V) based on this order, while assuming a 
factor a (0 < a < 1) controlling the relative size of each, 
given by 

Vl(V) = {j eV\I(j;V)<a\\V\\}, (2) 
V F (V) = {jeV\I(j;V)>a\\V\\}. (3) 



The relative ranking of node i with respect to G y is 
defined recursively as 

(i(i;V) \\v\\<~ 

R(i;V) = I R(i; V L (V)) \\V\\ > |,< e V L (V) (4) 

(\\V L (V)\\+R(i;V F (V)) \\V\\ > ±.,i?V L (V), 

where the first case corresponds to the triviality that ||V"|| 
being too small for subdivision, while the second and 
third correspond to the node being a leader and being a 
follower on a smaller scale respectively. If it is a leader, 
its place compared with other leaders is simply used; if 
a follower, we also need to add the total number of lead- 
ers in Vl to its place among other followers, due to the 
rule that followers are always ranked behind leaders as 
a whole. Such recursive reordering and division occur- 
ring on consecutively diminishing scales is schematically 
illustrated by Fig. [31 For example, for a network with 
N = 10, 000 nodes and a — 0.6, the set of all nodes 
is firstly divided into 6,000 leaders and 4,000 followers, 
and then the 6, 000 leaders are further divided into 3, 600 
leaders and 2,400 followers on the next scale. Such re- 
cursive division continues until only one node is left in 
the subgraph. 

Finally, the ranking derived by our method is simply 
given by 

R(i) = R(i;V), (5) 

where V is the collection of all nodes in the entire net- 
work. 

All links in E c are removed before applying this rank- 
ing method to the network (preserving these links as vir- 
tual two-way links will produce the same result due to 
cancellation in D A ). Direction prediction is simple once 
the ranking is obtained: each link in E c is predicted to be 
pointing from the lower-ranked node to the higher-ranked 
node. 
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TABLE I: Dataset description 



Datasct 




ll^ll 


Gnutella P2P network 


8,104 


26,008 


Facebook wall posts network 


43,953 


262,631 


Slashdot zoo network 


79,120 


515,571 


C. elegans neural network 


297 


2,345 



IV. EXPERIMENTS 

By using data of real networks, our ranking method 
is parameterized by selecting an optimal a. Then its 
performance for predicting direction of links is compared 
to ranking by in-degree, out-degree, degree difference D A 
as well as PageRank. 

Four real networks are used for experiments: (i) 
Gnutella P2P network [3(| HH, the peer-to-peer file shar- 
ing network of Gnutella, where one host is connected to 
another by a directed link; (ii) Facebook wall posts net- 
work [32[ , the network formed by wall posts of Facebook 
users in New Orleans, where a link from user A to user 
B means A has posted on B's wall; (iii) Slashdot zoo 
network [33|, the social network of slashdot.org, where a 
link from user A to user B means A has endorsed B as 
either "friend" or "foe"; (iv) C. elegans neural network 
[13, HH, the neural network of the worm C. elegans, where 
a directed link corresponds to a chemical synapse along 
which signals can be passed from one neuron to another. 
Their sizes are presented in Table Q] Note that the largest 
weakly connected component is used here for Gnutella 
P2P network and the Facebook wall posts network as 
they are not connected. Multiple links and self-loops are 
removed if contained in the original network. 

As a specifies the relative size of Vl{V) and Vf{V), we 
seek to select its value by examining the ranking's global 
conformity with all one-way links (with no reverse link) 
by applying our method to the whole network. Denoting 
the set of all one-way links by E g — G E \ £ 

-E}, where refers to a link from i to j, then the 

global conformity is given by 



c g = 



\\{{i,j)€E g \R(j)<Rm 
WEJ 



(6) 



which is the ratio of one-way links whose directions are 
in agreement with the ranking to the total number of 
one-way links. 

Figure [4] reports the global conformity C g under differ- 
ent values of a. It is found that the optimal value for a, 
that maximizes conformity lies around 0.6 for tested net- 
works except the neural network of C. elegans. C g > 0.92 
is reached at a = 0.6 for all networks, indicating the rank- 
ing's high conformity with link directionality. Therefore, 
for simplicity, we choose a = 0.6 for the task of direction 
prediction. 
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FIG. 4: The global conformity C g with one-way links versus 
different values of a. The optimal value for a occurs around 
0.6 for most networks. 



We randomly select a portion of links (denoted by the 
set E c ) out of all one-way links E g in a real network and 
a ranking's performance for direction prediction is eval- 
uated by computing the ranking's conformity with these 
links. Ranking algorithms are performed on the network 
after removing the links in E c . Only one-way links are 
used for evaluation because any ranking would be half- 
right and half-wrong for a pair of nodes connected by 
reciprocal links by our criteria. Besides our method, four 
other ranking methods, with the same prediction rule 
from rankings, are used for comparison: (i) PageRank, 
(ii) ranking in descending order of in-degree, (iii) rank- 
ing in ascending order of out-degree and (iv) ranking in 
descending order of degree difference D A , which is the 
local indicator used in our method. 

The algorithm of PageRank is briefly described as fol- 
lows. The PageRank score of a node i in the network can 
be computed by [36| 



Pt(i) 



E 



Pt-iU) , 1 



r^out 

3 



N 



(7) 



where Pt{i) denotes the probability of visiting node i at 
the time step t by a random walker. This random walker 
moves along the links of the network with probability c, 
corresponding to the first term in the right-hand side, 
while jumping to a randomly chosen node with proba- 
bility (1 — c), corresponding to the second term. The 
damping factor c is set to be 0.85 as commonly used [2(| 
and we have tested that its performance as a direction 
predictor is insensitive to this factor. By computing the 
formula above iteratively, a steady state can be reached 
and all nodes are then ranked in descending order of the 
probability P(i) in the stationary distribution. 
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FIG. 5: The performance of our method compared with 
PageRank, in-degree, out-degree and degree difference D A on 
real networks: (a) Gnutella P2P, (b) Facebook wall posts, (c) 
Slashdot zoo, (d) C. elegans. Conformity C is drawn against 
the fraction of selected links among one-way links \\E C \\/\\E g \\. 
The results are obtained by averaging over 10 independent 
runs and error bars represent standard deviations, which may 
be too small to be seen in (a), (b) and (d). 



Figure [5] reports the results on four real networks, 
where conformity C is drawn against the fraction of se- 
lected links among all one-way links ||-E c ||/||i? g || . Our 
method obviously outperforms other methods, achiev- 
ing especially high conformity for networks of Gnutella 
P2P and Slashdot zoo, validating its effectiveness for di- 
rection prediction. The performance of our method is 
also stable and only small decrease in conformity is ob- 
served even when ||_E c ||/||£' g || reaches 0.5. Meanwhile, 
D A is evidently a better local indicator for direction than 



D m and D° ut . In fact, despite its simplicity, its perfor- 
mance approaches and even exceeds the performance of 
our method when many links are removed in Slashdot 
zoo network. 



V. CONCLUSION AND DISCUSSION 

In directed networks, directions of links and rankings 
are closely connected. While directions provide rich topo- 
logical information for ranking algorithms, a proper rank- 
ing of nodes also reflects the directional relations among 
nodes. In this paper we explore the latter aspect of this 
connection and we use the presented ranking method to 
predict unknown directions of links in a network, comple- 
menting current progress on the topic of link prediction. 

Directions are related to both local measures and 
global properties, where the trade-off between the two is 
a tough challenge. Purely relying on either local or global 
measures can hardly produce effective inference of the di- 
rections of links. This difficulty can be much resolved by 
considering the hierarchical structure of real networks. 
Simple local measures like in-degree and out-degree, de- 
spite their limited fineness at a global scale, tend to tell 
us more about the topology as we investigate the network 
at a smaller scale by extracting the subgraph induced by 
a fewer number of nodes. This procedure naturally goes 
in a recursive fashion as the hierarchical structure is in 
itself self-similar ^UMl- 

Apart from the purpose of direction prediction, our 
method can also be used as an effective heuristic for con- 
structing the maximum acyclic subgraph of a directed 
network. 
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